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Abstract 

Background: Gastric cancer is the second highest cause of global cancer mortality. To explore the complete 
repertoire of somatic alterations in gastric cancer, we combined massively parallel short read and DNA paired-end 
tag sequencing to present the first whole-genome analysis of two gastric adenocarcinomas, one with 
chromosomal instability and the other with microsatellite instability. 

Results: Integrative analysis and de novo assemblies revealed the architecture of a wild-type KRA5 amplification, a 
common driver event in gastric cancer. We discovered three distinct mutational signatures in gastric cancer - 
against a genome-wide backdrop of oxidative and microsatellite instability-related mutational signatures, we 
identified the first exome-specific mutational signature. Further characterization of the impact of these signatures 
by combining sequencing data from 40 complete gastric cancer exomes and targeted screening of an additional 
94 independent gastric tumors uncovered ACVR2A, RPL22 and LMAN1 as recurrently mutated genes in microsatellite 
instability-positive gastric cancer and PAPPA as a recurrently mutated gene in TP53 wild-type gastric cancer. 

Conclusions: These results highlight how whole-genome cancer sequencing can uncover information relevant to 
tissue-specific carcinogenesis that would otherwise be missed from exome-sequencing data. 



Background 

Gastric cancer (GC) is the fourth most common cancer 
and the second leading cause of cancer death worldwide. 
Early stage GC is often asymptomatic or associated with 
non-specific symptoms, resulting in most patients pre- 
senting at advanced disease stages. Treatment options 
for late-stage GC patients are limited, with surgery and 
chemotherapy regimens offering modest survival bene- 
fits. Environmental risk factors for GC include a high 
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salt diet, smoking, and infection by Helicobacter pylori 
[1]. Understanding the mutational impact of these envir- 
onmental exposures on the genomes of gastric epithelial 
cells is essential to shed light on specific genes and 
pathways associated with gastric tumorigenesis. 

Previous studies in lung cancer [2,3], melanoma [4], 
and leukemia [5] have shown that environmental carci- 
nogens and drugs can elicit specific somatic mutational 
profiles in cancer genomes, referred to as 'mutational 
signatures'. While previous studies on GC have applied 
exome-sequencing approaches to identify frequently 
mutated genes [6,7], identifying mutational signatures is 
best done using whole-genome data, due to its comple- 
teness and ability to simultaneously uncover micro- and 
macro-scale somatic alterations. In this study, we sought 
to provide a more comprehensive understanding of 
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mutational processes in GC by analyzing whole-genome 
sequences of two GCs and their matched-normal con- 
trols, using both short-read (SR) next-generation 
sequencing and a long insert (approximately 10 kbp) 
DNA paired-end tag (DNA-PET) protocol [8]. We also 
sought to explore the combination of these datasets for 
de novo assembly of cancer and normal genomes and to 
comprehensively catalogue a range of (point mutations 
to megabase-sized) somatic alterations in the tumor. 
Finally, we used this catalogue to characterize the 
impact of mutational processes on genes and used a 
screening approach to validate recurrently mutated 
genes in subtypes of GC defined by specific mutational 
processes. 

Results 

Integrative short read/DNA-PET analysis and de novo 
assembly 

The matched tumor and normal samples analyzed were 
from two Singaporean patients. One GC exhibited evi- 
dence of microsatellite instability (MSI) and active H, 
pylori infection (see Table SI in Additional file 1 for other 
clinical characteristics). Each tumor and matched normal 
sample was sequenced to more than 30-fold average base 
pair coverage by Illumina SR sequencing (Materials and 
methods; Table S2 in Additional file 1), and to > 130-fold 
physical coverage using large-insert (approximately 10 
kbp) DNA-PET sequencing [9] on the SOLiD platform 
(Materials and methods; Table S3 and Note 1 in Addi- 
tional file 1). Single nucleotide variants (SNVs) and short 
insertions and deletions (indels) from tumor and normal 
genomes were combined to identify somatic variants 
(Table 1 and Materials and methods) and reliability of 
somatic calls was confirmed using targeted sequencing 
(validation rate of 90% for SNVs and 96% for indels; Mate- 
rials and methods). SR and DNA-PET data were also used 
to identify somatic copy-number variations (CNVs) and 
structural variations (SVs) (validation rate = 81%; Materi- 
als and methods; Note 1 in Additional file 1). 
We integrated the SR and DNA-PET sequence informa- 
tion to perform de novo assembly of the tumor and nor- 
mal genomes. While complete de novo assembly of a 
tumor genome still poses significant technical challenges 
and has not been attempted before, we were able to use 
the SR/DNA-PET data to construct highly contiguous 
draft assemblies of median scaffold lengths (N50) in the 
range of 41 to 148 kb, with DNA-PET data assisting in tri- 
pling sequence contiguity of the assemblies (Materials and 
methods; Note 2 and Table S5 in Additional file 1). Impor- 
tantly, performing de novo SR/DNA-PET assembly 
revealed several findings not observed using conventional 
analyses of the SR data. First, the de novo approach 
allowed for characterization of large-scale somatic struc- 
tural variations at single base-pair resolution (SR libraries 



Table 1 Somatic variations in two GC tumors identified 
by whole genome sequencing approaches 


Patient ID 


NGCII082 


NGCII092 


SNVs, all somatic 


14,856 


1 7,473 


Coding regions 


119 


116 


Non-synonymous 


86 


73 


Promoter regions 


101 


161 


Indels, all somatic 


11,738 


2,486 


Coding regions 


12 


2 


CNVs, all somatic 


836 


21,776 


Affecting genes 


3 


265 


SVs, all somatic 


12 


146 


Affecting genes 


11 


96 


Deletions 


6 


56 


Tandem duplications 


2 


8 


Unpaired inversions 


0 


26 


Inversions 


0 


2 


Insertions (intra-chromosomal) 


0 


0 


Insertions (inter-chromosomal) 


0 


0 


Isolated translocations 


0 


3 


Balanced translocations 


0 


0 


Complex events (intra- chromosomal) 


A 


49 


Complex events (inter- chromosomal) 


0 


2 



were unable to identify nearly half of the validated SVs 
and fusions genes; Note 1 in Additional file 1). For exam- 
ple, NGCII092 exhibited a focal genomic amplification on 
chromosome 12pll-12 in a region containing the wild- 
type KRAS gene, a genomic event frequently observed in 
GC [10]. The combined SR/DNA-PET data (Materials and 
methods) enabled a detailed putative reconstruction of the 
evolutionary lineage of the amplified KRAS locus with 
concomitant deletion of a proposed tumor suppressor 
gene RASSF8 (as well as another focal amplicon at chro- 
mosome 6p) as described in the supplementary text 
(Figure 1; Figures SI and S2 and Note 3 in Additional file 
1). Reconstruction of the tumor genomes also allowed the 
prediction of fusion genes and complex rearrangements 
that resemble patterns created by replication coupled 
mechanisms [11] and are further described in the supple- 
mentary text (Note 4 and Figures S3 and S4 in Additional 
file 1 and Table S6 in Additional file 2). 

Second, a combined SR/DNA-PET analysis allowed us 
to assemble sequences present in the tumor genome but 
not in the reference human genome. For example, in 
patient NGCII082 exhibiting active H. pylori infection, 
we detected approximately 2,000 short-sequence reads 
and > 600 DNA-PET tags corresponding to the H. pylori 
genome (the first such report for a bacterial pathogen 
from tumor sequencing), in addition to a tumor-asso- 
ciated microbiome (these were not seen in NGCII092; 
see Figure S5 and Note 5 in Additional file 1 for details). 
Note that, despite being fewer in number, the DNA-PET 
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Figure 1 Copy number of two gastric cancer genomes, mechanism of 12p amplification and creation of a fusion gene (a) Somatic 
CNVs in the two gastric tumors (chromosomes are arranged on the x-axis, copy number is shown on the y-axis). (b) Copy number of 
chromosome 12 (top) and the amplicon on 12p (middle) are shown in orange (y-axis). Rearrangements identified by DNA-PET clusters with a 
size > 45 are represented by arrows and connecting lines (bottom). Dark red and pink arrows represent 5' and 3' cluster regions, respectively, 
with the connection between the tip of the dark red and the blunt end of the pink arrows. Numbers represent cluster sizes, (c) Fusion between 
50X5 and OVCH1 predicted by a rearrangement point with cluster size of 129 in (b). 
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tags contributed significantly to the physical coverage 
and analysis of the genomes (Figure S5 and Note 5 in 
Additional file 1). 

Third, the de novo assembly enabled annotation of 
human genes and variants in sequences absent in the 
reference genome. In total, we identified more than 3 Mbp 
of novel sequence (longer than 500 bp), containing several 
genes (including an ortholog to a cytokine receptor-like 
factor - CRLF2), and more than a 1,000 somatic and germ- 
line variants for each patient (Materials and methods; 
Note 2 and Table S5 in Additional file 1). 

Mutational signatures of damage by reactive oxygen 
species, deamination and microsatellite instability 

We characterized mutational signatures in the GC gen- 
omes based on 14,856 somatic SNVs (11,738 indels) in 
NGCII082 and 17,473 somatic SNVs (2,486 indels) in 
NGCII092 that were identified from the whole-genome 
data (Table 1). This accounts for an average mutation fre- 
quency of 5 per megabase and included > 100 SNVs in 
protein coding regions for each tumor (Table 1; Note 6 in 
Additional file 1). Note that we identified more than five 
times the number of somatic variants uncovered in earlier 
sequencing studies [6,7] that were restricted to exomes 
(5,588 SNVs and 2,347 indels identified from 37 exomes), 
highlighting the statistical advantage of whole-genome 
analysis for studying mutational signatures. Overall, 
NGCII082, an MSI-positive tumor, displayed an excess of 
SNVs in protein coding regions (P-value < 0.02, % 2 test) 
and a striking seven-fold higher frequency of micro-indels 
(Figures 2 and 3d) but a lack of large-scale SVs and ampli- 
fications or deletions (Figure 2 and Table 1). In contrast, 
NGCII092 exhibited a complex copy number profile of 
extensive focal amplifications and deletions, and a mutated 
TP53 gene, consistent with the presence of chromosomal 
instability (CIN) in the tumor genome (Figure 2). These 
results agree with the mutual exclusivity seen in MSI and 
CIN pathways for inducing mutations in other cancers as 
well [12]. 

The clear excess of micro-indels in the MSI-positive GC 
(Figure 3d; Figure S10 in Additional file 1) was character- 
ized by a pattern of single base-pair thymine deletions in 
mononucleotide repeats (79%). In contrast, there were a 
comparable number of insertions in both the MSI-positive 
and CIN-positive GC, and a similar deletion-specific pat- 
tern has also been noted before [13]. Also, non-thymine 
and non-mononucleotide repeat deletions were not found 
to be in excess. The correlation between MSI phenotype 
and the specific deletion signature identified here was 
further confirmed from previous exome-sequencing data 
[7] (four MSI-positive exomes), though this aspect was not 
noted in the previous work. In terms of genomic location, 
the deletions were randomly scattered throughout the 
genome and occurred in proportion to the regional 



presence of thymine mononucleotide repeats (that is, 85% 
of homopolymers > 5 bp). Thus, despite the bias towards 
thymine deletions, there seems to be an absence of a tar- 
geting mechanism on the genome for the MSI-associated 
signature. 

Despite exhibiting very different somatic alteration pat- 
terns (MSI or CIN), the mutational frequencies of both 
GCs at the single nucleotide level were highly similar, 
being significantly biased towards C > A and T > A altera- 
tions compared to normal genomes (P-value < 10" , % 
test; Figure 3a). These alterations likely represent muta- 
tions caused by reactive oxygen and nitrogen species (ROS 
and RNS), which are known to produce C > A and T > A 
mutations [14]. Also, a likely trigger is H. pylori infection, 
which has been shown to cause chronic inflammation and 
ROS/RNS production in gastric epithelial cells [14]. The 
C > A mutations observed were associated with highly sig- 
nificant sequence-selectivity, being marked by an excess at 
CpCpT (NGCII082, odds ratio (OR) = 3.2, P-value < 10" 16 , 
X 2 test) or TpCpA sites (NGCII092, OR = 1.7, P-value < 
10 16 > X 2 test) and extensions of these motifs (Materials 
and methods; Note 6 and Figure S6 in Additional file 1 
and Table S14 in Additional file 6). This pattern is distinct 
from the C > A signature seen in smoking-associated 
small-cell lung cancer where an excess was seen in CpG 
dinucleotides outside CpG islands, suggesting a link with 
methylation status [2,3]. Further work is required to iden- 
tify the mechanistic basis of sequence selectivity in this 
genome-wide GC-specific signature. 
Exome-biased mutational signature in GC 
Unlike the MSI and ROS/RNS signatures that were pre- 
sent in coding and non-coding regions of the genome, we 
also detected a third GC mutational signature only evident 
in coding regions (Figure 3b), characterized by an excess 
of C > T mutations. These mutations were in excess at 
CpG (NGCII082, OR = 1.2, P-value < 10" 16 , % 2 test) and 
GpC site (NGCII092, OR = 1.4, P-value < 10" 16 , j 2 test) 
dinucleotides. The CpG alterations likely represent deami- 
nation of methylated cytosines followed by errors asso- 
ciated with transcription-coupled repair, which has also 
been observed in other cancers [2,4]. However, the latter 
bias towards C > T alterations occurring at GpC motifs 
appears to be a unique feature not previously reported in 
other cancers [2,4] and could represent deamination due 
to enzymes such as AID (activation-induced cytidine dea- 
minase) [15]. AID is known to preferentially target tran- 
scribed regions [16] and is aberrantly activated due to H. 
pylori infection in the gastric epithelium [17]. Taken col- 
lectively, our whole-genome sequencing data implicates a 
minimum of three mutational signatures present in GC 
genomes, related to the presence of MSI, ROS/RNS, and 
deamination processes. 

To further characterize the mutational signatures, we re- 
analyzed a total of 40 GC exomes, combining data from 
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Figure 2 Map of somatic alterations in two gastric cancer genomes. The Circos plots depict the following information in order from outer 
to inner rings: using WGS data (1) CNVs (gain in red capped at 10 copies and loss in gray), (2) indel density (indel frequency per 10 kbp in blue, 
capped at 5 indels/10 kbp), (3) SNV density (SNV frequency per 10 kbp in black, each ring is 5 SNVs/10 kbp, capped at 10), and using DNA-PET 
data, (4) deletions (in red), tandem duplications (green) and inversions (purple), (5) intra- and (6) inter-chromosomal, insertions (orange) and 
unpaired SVs (gray). 



(a) 



o 



CD 

> r-, 



o) Ti- 
ro 

I 8 
2 

s. ° 



NGCII082 



□ germline variations 
■ somatic variations 



CD 



OA OG OT T>A T>C T>G 



NGCII092 



[B □□__[&_]□□ 

C>A C>G OT T>A T>C T>G 



(b) 



o 



■ c CO 



ro 
> 



O) 

ro 

m 
o 

cd 
Q. 



NGCII082 



□ germline variations 
■ somatic variations 



□_ 



C>A C>G C>T T>A T>C T>G 



NGCII092 



Jo 



OA OG OT T>A T>C T>G 



(c) 



CO 

ro 

!cs 

ro 
s= 
g 

ro 



H. pylori status 

Negative 
■ Positive 



IZR 



=L3 



OA OG OT T>A T>C T>G 



(d) 



<D 
"O 
CZ 



03 

ro 
cz 

03 
O 

CD 
Q. 



NGCII082 

□ germline deletions 

□ germline insertions 

■ somatic deletions 

■ somatic insertions 



ib_u_ e 



>5 



Size of indels [bp] 



NGCII092 



II a 

2 3 4 5 

Size of indels [bp] 



dH e 

4 



>5 



Figure 3 Genome-wide and exome-wide mutational fingerprint, (a) Frequency of various classes of somatic SNVs genome-wide, (b) Frequency of 
somatic SNVs exome-wide. (c) Mutational bias as a function of infection status using data from 34 exomes (bias for SNV class / was computed as 
(Si-gJ/g,, where s, and g, are the somatic and germline SNV frequencies). Note that nearly identical results were obtained when MSI tumors were 
excluded from the analysis (*P-value < 0.1; **P-values < 0.01, respectively), (d) Size-distribution of germline and somatic indels genome-wide. 
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earlier studies [6,7] with two new exomes in this study 
(Materials and methods; Table S8 and Figure S7 in Addi- 
tional file 1). Specifically, a comparison of somatic and 
germline frequencies for the exomes showed that all but 
one patient had a significant excess of C > A (ROS/RNS- 
related) or C > T (deamination-related) alterations and 23 
GCs (> 50%) had an excess of both mutations (Fisher's 
exact test P-value < 0.01), establishing these two muta- 
tional classes as the most significant single-nucleotide 
alterations in GC. These patterns were independent of his- 
tological subtype (intestinal, diffuse and mixed-type) and 
MSI status (the excess is also seen in all but one non-MSI 
tumor). Moreover, the frequencies of C > T and C > A 
mutations were significantly different in GCs with active 
H. pylori infection compared to those lacking active infec- 
tion (Wilcoxon rank sum test P-value < 0.006 and 0.06, 
respectively; Figure 3c). Overall, these results support the 
widespread role of ROS/RNS-associated C > A and deami- 
nation-associated C > T mutations in gastric cancer and 
are suggestive of their link to H. pylori infection. 

A strong signature for transcriptional-coupled repair has 
been described before in other cancers [2,4] and our analy- 
sis also confirmed this in GC, in that poorly transcribed 
regions of the genome were associated with significantly 
more mutations (Figure S8 and Note 8 in Additional file 
1). However, in contrast with earlier reports, we did not 
see a significant bias for mutations in the transcribed ver- 
sus non-transcribed strand in most mutational classes 
(except for T > G, P-value < 0.05, j 2 test; Figure S8 in 
Additional file 1). The absence of this latter pattern may 
be a consequence of the higher mutational burden from 
mutagens that also act in a transcription-coupled fashion 
(for example, AID [16]). 

Impact of mutational signatures on genes in GC 

The overall impact of the mutational signatures identified 
here on gastric tumorigenesis is a complex question influ- 
enced by several factors, including the nature of muta- 
tions, the function of genes that are frequently impacted 
as well as genetic background and selection processes. We 
aimed to provide an initial assessment using two 
approaches: (i) by characterizing the proportion of genes 
affected by various mutational classes; and (ii) by identify- 
ing recurrently mutated genes in subtypes of GC defined 
by mutational processes. 

Overall, a majority of mutated genes in NGCII082 were 
due to SNVs (77%) while CNVs and SVs played a domi- 
nant role in NGCII092 (82%) (Table 1). In total, we identi- 
fied 107 SVs that affected genes by truncation, fusion, 
deletion, tandem duplication or rearrangements within the 
gene body. Ninety-six (90%) of these were identified in the 
CIN phenotype exhibiting tumor NGCII092, illustrating 
the genie burden from this mutational process. In contrast, 



small insertions and deletions (indels) were seen in few 
genes, even in the tumor with MSI phenotype (despite 
indels being roughly as common as SNVs genome-wide; 
Table 1), though their ability to cause frameshifts is likely 
to impact gene function more often than SNVs. Among 
SNVs, even though the deamination-related C > T signa- 
ture is only seen in a small fraction of the genome, it plays 
a larger role in GC due to its targeted impact on genes. 
More than 48% of the non-synonymous mutations seen 
(48% in NGCII092 and 59% in NGCII082) in the two 
tumors were due to C > T mutations, compared to less 
than 19% for C > A mutations (Table 1). Among recur- 
rently mutated genes in GC (Table S7 in Additional file 1 
and Table S9 in Additional file 3), non-synonymous muta- 
tions in the tumor suppressor genes TP53 (mutated in 
50% of samples) and PTEN (18% of samples), and onco- 
genes PIK3CA (13%; 8% have PTEN and PIK3CA muta- 
tions) and CTNNB1 (10%) were often in the form of C > 
T mutations (29%). This was also seen in several novel 
recurrently mutated genes such as AQP7, SPTA1 and 
RP1L1 (mutated in > 10% of tumors; Table S7 in Addi- 
tional file 1). 

Pathway analysis of mutated genes revealed that the 
two most enriched sets were |31-integrin mediated cell- 
surface interactions and signaling events mediated by 
class III histone deacetylases, a refinement of previous 
analysis [7] (Table S10 in Additional file 4). Furthermore, 
we identified genes implicated in RAC1 regulation to be 
mutated in 83% of H. pylori positive samples (P-value 
< 0.05 Fisher's exact test). RAC1 is a member of the Rho 
GTPase family known to play diverse oncogenic roles 
[18], shown to regulate the H. pylori virulence factor 
VacA, and known to promote vacuole formation in 
epithelial cells [19]. Mutations in the RAC1 pathway 
could thus simultaneously promote H. pylori infection as 
well as gastric tumorigenesis. 

Finally, to further characterize the impact of mutational 
processes on genes in GC, we considered two specific 
subtypes for identifying recurrently mutated genes, MSI- 
positive GC and TP53-wild-type GC (Tables Sll and S13 
in Additional file 1 and Table S12 in Additional file 5). 
We used rP53-wild-type status as a surrogate marker for 
tumors without the CI phenotype as TP53 is known to 
suppress chromosomal instability [20]. In this class of 
GCs, in addition to the tumor suppressor gene PTEN 
and TTK that interact with TPS3, we identified PAPPA, a 
marker for pregnancies with aneuploid fetuses [21], as 
being recurrently mutated (Table S13 in Additional file 1; 
note that the average mutation rate for the whole- 
genome sequencing (WGS) samples in an approximately 
2 Mbp window surrounding PAPPA is similar to the 
genome-wide rate, that is 5.3 versus 5.2 mutations/Mbp). 
A screen of an additional 94 gastric cancer/normal pairs 
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confirmed the frequency of PAPPA mutations as being 6% 
among all GC samples (Table S12 in Additional file 5) and 
20% among TPS3 wild-type GCs (with mutations in key 
functional domains; Figures S13 and S14 in Additional 
file 1), highlighting it as a potential driver gene in this 
subtype. 

In MSI-positive GCs, ACVR2A, RPL22, LMAN1, and 
STAU2 were observed to have recurrent single base thy- 
mine deletions in poly(T) regions (Table Sll in Additional 
file 1) and this was confirmed in a screen of an additional 
94 gastric cancer/normal paired samples (9 MSI-positive; 
Table S12 in Additional file 5 and Figure S9 and Note 9 in 
Additional file 1). In total, ACVR2A was mutated in a 
region of 8 thymines in 86% of MSI-positive GCs tumors, 
RPL22 in a region of 8 thymines in 64%, LMAN1 in a 
region of 9 thymines in 50% and STAU2 in a region of 8 
thymines in 29%. Based on the average frequency of muta- 
tions in homopolymer regions in the MSI-positive tumors 
(4.5% of 8 thymine stretches (n = 778) and 4.8% of 9 thy- 
mine stretches (« = 183), respectively, in exomic regions), 
mutations in ACVR2A, RPL22 and LMAN1 were in signifi- 
cant excess (Bonferroni-corrected P-value < 0.0003, exact 
binomial test). In each gene, all the deletions occurred in 
the same homopolymer tract containing thymines, a pat- 
tern linked to the MSI phenotype, and none of the MST 
negative GC tumors carried these mutations. In contrast, 
mutations in the recendy reported MSTassociated putative 
driver gene ARID1A were not restricted to deletions or 
MSI-positive tumors [7]. Interestingly, ACVR2A (encoding 
a TGF-P super-family differentiation factor) has been 
described to be recurrently mutated in MSI-positive color- 
ectal cancer [22]. Also, the frequency of mutations seen 
here is comparable to the previously reported frequency 
in MSI-positive colorectal cancer [23,24] and emphasizes 
the importance of ACVR2A and TGF-P signaling in MSI- 
positive GC, while unraveling the oncogenic roles of 
RPL22 and LMAN1 requires further investigation. 

Discussion 

Until long read sequencing of several kilo-base pairs is 
routine, the combination of SR and long fragment mate- 
pair sequencing remains the most powerful approach to 
comprehensively capture micro- and macro-scale altera- 
tions in the cancer genome. The combination of SR and 
DNA-PET sequencing in this study thus provides the 
first comprehensive assessment of somatic alterations in 
GC. In particular, our results highlight the importance 
of whole-genome analysis for reconstructing the lineage 
of complex somatic structural variants and characteriz- 
ing mutational process and their genomic impact in 
cancer. For example, while point mutations in the KRAS 
gene have been well characterized, our whole-genome 
analysis enabled the first detailed reconstruction of 
amplification in the KRAS locus (a common event in 



GC) and a concomitant deletion of a proposed tumor 
suppressor gene RASSF8. 

The analysis of several exome-sequencing datasets in 
earlier studies [6,7] was able to provide only a limited view 
of mutational processes in GC. Whole-genome analysis 
was essential for providing sufficient detail and statistics to 
identify the features and relative impact of the various 
mutational processes (for example, MSI, ROS/RNS and 
CI). This is best exemplified by the identification of a 
uniquely localized, deamination-linked mutational finger- 
print whose significance would have been missed in an 
exome-based study. We further characterized the impact 
of this mutational process and identified the recurrently 
mutated genes PAPPA, ACVR2A, RPL22, LMAN1, 
and STAU2 in subtypes of GC defined by mutational 
processes. 

Conclusions 

While computational tools for de novo cancer genome 
assembly are limited, its utility is demonstrated by our 
reconstruction of the H. pylori strain genome and assem- 
bly-based characterization of SVs and fusion genes at the 
base pair level. As sequencing costs continue to drop, 
whole-genome sequencing and assembly of affected tissues 
can serve as a tool for biomarker and pathogen discovery 
in cancer and other diseases. Assembly tools need to be 
refined to address the twin challenges of genomic amplifi- 
cations and mixed cell populations and the availability of 
whole-genome SR and DNA-PET data from the clinical 
samples in this study should serve as a useful resource in 
this effort. 

Materials and methods 

Patient samples and clinical information 

Patient samples and clinical information on tissue and 
blood samples were obtained from patients who had 
undergone surgery for gastric cancer at the National 
University Hospital, Singapore, and Tan Tock Seng Hospi- 
tal, Singapore. Informed consent was obtained from all 
subjects and the study was approved by the Institutional 
Review Board of the National University of Singapore 
(reference code 05-145) as well as the National Healthcare 
Group Domain Specific Review Board (reference code 
2005/00440). Clinical information for the two patients 
whose samples were analyzed by whole-genome sequen- 
cing is provided in Table SI in Additional file 1 and 
additional information for the 94 gastric tumors used for 
targeted screening is provided in Table S12 in Additional 
file 5. 

Library preparation and sequencing 

For WGS sequencing, genomic DNA isolated from tumor 
and blood samples was randomly fractionated using a 
Roche Nebulizer following the manufacturer's instructions 
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(Madison, Wisconsin, USA). Fractionated DNA was then 
end-repaired, A-tailed at the 3' end, ligated with Illumina 
paired end adaptors, PCR amplified followed by gel-selec- 
tion of a range of 400 to 600 bp fragments as templates 
and sequenced by Illumina GA from both ends to obtain 
76 or 101 bp reads at each end (Table S2 in Additional file 
1). DNA-PET libraries were constructed as described else- 
where [9] and were sequenced by the Applied Biosystems 
SOLiD system (Carlsbad, California, USA, Table S3 in 
Additional file 1). Exome sequencing was performed as 
described earlier using SureSelect Human All Exon Kit vl 
(Agilent Technologies, Santa Clara, California, USA) and 
sequencing on two lanes of Illumina GA-IIx sequencer 
using 76 bp paired-end reads [6] . 

Mapping and variant calling 

Paired-end Illumina reads were mapped to the reference 
human genome (UCSC hgl8) using ELAND (Illumina 
Inc.) and reads that failed pass-filter were removed from 
further analysis. SNVs and indels were called for each 
sample separately using SAMtools [25] (vO. 1.7-6, SNP- 
quality threshold = 20, consensus-quality threshold = 30) 
(Table S4 in Additional file 1). Identical variant calls in 
tumor and matched normal samples were used to iden- 
tify germline variants. Variant calls unique to the tumor, 
where the normal genotype called by SAMtools was dif- 
ferent and where less than two reads of the variant geno- 
type were seen in the normal sample, provided the list of 
somatic variants. Illumina reads from exome sequencing 
were analyzed using this pipeline after BWA [26] map- 
ping (Table S8 in Additional file 1). As a control, we 
noted that germline SNV frequencies were nearly identi- 
cal across all exomes from WGS and exome sequencing 
datasets (Figure S7 in Additional file 1). Somatic SNV 
frequencies and neighborhoods were compared to germ- 
line frequencies to assess enrichment. A neighborhood of 
up to 2 bp surrounding an SNV was used to identify 
enriched motifs. Somatic indel calls were required to be 
supported by at least 20% of the reads, by reads on both 
strands, with a minimum of 10 reads overlapping the 
position in the tumor and no indel calls in the normal 
sample. Somatic SNVs and indels in protein-coding 
regions and introns were confirmed by Sanger sequen- 
cing to have a high validation rate (83 SNVs, validation 
rate = 90%; 72 indels, validation rate = 96%). SNV neigh- 
borhood analysis was done by extracting 5 bp sequences 
upstream and downstream of mutations. Germline and 
somatic copy number variants were identified using the 
program RDXplorer [27] with default parameters. 

DNA-PET tags were mapped individually to the refer- 
ence human genome (UCSC hgl8) in color space allowing 
two color code mismatches per tag by the SOLiD System 
Analysis Pipeline Tool Corona Lite (Applied Biosystems 
Inc.). Contigs of the reference sequence with unresolved 



location (random chr) and alternative MHC haplotypes 
were excluded from the reference for mapping. Individu- 
ally mapped tags were paired by Corona Lite. In cases 
where one or both tags had multiple mapping locations, a 
process termed 'rescuing' favored the creation of concor- 
dant PETs (both tags are on the same chromosome, same 
strand, same orientation, correct 5' — > 3' order and in the 
expected distance to each other). 

SVs, based on clusters of non-concordant PETs, were 
called using the GIS DNA-PET pipeline [9] with refined 
quality control criteria: (i) PET clusters of size < 6 were 
excluded; (ii) the regions to which the 5' and 3' tags of 
a cluster mapped had to be at least 1 kbp in size each; 
(iii) PET clusters that had a supercluster (connected 
component of overlapping clusters [9]) size > 100 
required a higher cluster size of 10; and (iv) PET clus- 
ters with high sequence similarity between the two 
fused regions (BLAST score > 2,000 for 20 kbp windows 
around the predicted break points) were excluded. To 
distinguish between germline and somatic SVs, paired 
normal and tumor samples were compared as described 
previously [9]. Further filtering of known germline SVs 
and PCR validation are described in Note 1 in Addi- 
tional file 1. 

Cancer genome assembly 

Contig assembly, scaffolding and gap-filling of the Illu- 
mina sequencing data were done using the assembler 
SOAPdenovo [28]. DNA-PET reads were mapped to the 
SOAPdenovo assembly with Bowtie [29] and the result- 
ing linking information was used to produce larger scaf- 
folds based on the optimal scaffolder Opera [30]. 
Scaffolds and contigs were refined further with the gap- 
filling module in SOAPdenovo, employed for bridging 
scaffold gaps, where feasible. Using the SR reads alone, 
we obtained 12 kb scaffold N50 for both tumors. The 
DNA-PET reads allowed for improvement of assembly 
connectivity to a N50 of 65 kb and 41 kb for NGCII082 
and NGCII092, respectively. Assemblies were compared 
to the reference human genome (UCSC hgl8) using the 
MUMmer package [31] and alignments longer than 1 
kbp were used to identify deletions and insertions larger 
than 20 bp. Overall, 12,861 deletions and 143 insertions 
were found in NGCII082 and 9,274 deletions and 108 
insertions in NGCII092 of which 3 events > 2 kbp 
missed by DNA-PET analysis were identified in each 
sample. Fusion genes were validated and breakpoints 
were confirmed by using the gap-filling module in 
SOAPdenovo to bridge scaffolds constructed around the 
breakpoint. Sequences missing in the reference human 
genome were identified based on the criteria that they 
should be > 500 bp long and have no match to the 
reference genome with > 90% identity. Reads were 
mapped to the novel sequences using Bowtie to identify 
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regions with no read coverage in the middle of a scaf- 
fold that could indicate a potential mis-assembly. 

Analysis of microbial sequences 

Reads with a putative microbial or viral origin were identi- 
fied by mapping reads with no mapping to the human 
genome, to a database of complete bacterial and viral gen- 
omes in NCBI (using Bowtie [29]). Matches were filtered 
for low-complexity sequences (more than three matches 
of any 5-mer) and the remaining reads were used to esti- 
mate the abundance for each species (pooling reads 
mapped to different strains of a species). Each species was 
checked for multiple distinct read matches to its genome 
(> 4 distinct regions, where the genome was segmented in 
1 kbp windows) and the presence of unique read matches 
(using the unique option in Bowtie). The small fraction of 
reads of putative bacterial origin in the matched blood 
samples (possibly reagent contamination) were used as 
control and read matches to the corresponding species 
were excluded in determining the tumor associated micro- 
biome. Concentration of H. pylori cells in relation to 
tumor cells was estimated based on the assumption of 
uniform coverage of both cell types, where coverage = k x 
Number of cells x Size of genome, for a constant k and 
the populations are assumed to be clonal. 

Functional annotation of SNVs and indels 

For all samples, SNV and indel calls were annotated using 
the SeattleSeq server [32] and SIFT [33], respectively. Path- 
way analyses were performed based on non-synonymous 
SNVs and indels using the Pathway Interaction Database 
[34] (sample pfg005T from Wang et al. [7] was excluded as 
it only had four somatic mutations). 

Data access 

Sequencing data for this publication have been deposited 
in NCBI's Gene Expression Omnibus [35] and is accessible 
through GEO Series accession number GSE30833. 

Additional material 
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